Search CORE

134 research outputs found

Reducing the complexity of the register file in dynamic superscalar processors

Author: Balasubramonian Rajeev
Dwarkadas Sandhya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Journal ArticleDynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multi-ported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals

The University of Utah: J. Willard Marriott Digital Library

Dynamically managing the communication-parallelism trade-off in future clustered processors

Author: Balasubramonian Rajeev
Dwarkadas Sandhya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

Journal ArticleClustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instruction-level parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application's needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11% performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communication-bound processors of the future

The University of Utah: J. Willard Marriott Digital Library

Dynamically allocating processor resources between nearby and distant ILP

Author: Balasubramonian Rajeev
Dwarkadas Sandhya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Journal ArticleModern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file, issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the sizes of these structures, resulting in conflicting design requirements. In this paper, we present a novel microarchitecture designed to overcome the limitations of a register file size dictated by cycle time constraints. Available registers are dynamically allocated between the primary program thread and a future thread. The future thread executes instructions when the primary thread is limited by resource availability. The future thread is not constrained by in-order commit requirements. It is therefore able to examine a much larger instruction window and jump far ahead to execute ready instructions. Results are communicated back to the primary thread by warming up the register file, instruction cache, data cache, and instruction reuse buffer, and by resolving branch mispredicts early. The proposed microarchitecture is able to get an overall speedup of 1.17 over the base processor for our benchmark set, with speedups of up to 1.64

The University of Utah: J. Willard Marriott Digital Library

Hierarchical Parallelization of Gene Differential Association Analysis

Author: Dwarkadas Sandhya
Hu Rui
Needham Mark
Qiu Xing
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Microarray gene differential expression analysis is a widely used technique that deals with high dimensional data and is computationally intensive for permutation-based procedures. Microarray gene differential association analysis is even more computationally demanding and must take advantage of multicore computing technology, which is the driving force behind increasing compute power in recent years. In this paper, we present a two-layer hierarchical parallel implementation of gene differential association analysis. It takes advantage of both fine- and coarse-grain (with granularity defined by the frequency of communication) parallelism in order to effectively leverage the non-uniform nature of parallel processing available in the cutting-edge systems of today. Results Our results show that this hierarchical strategy matches data sharing behavior to the properties of the underlying hardware, thereby reducing the memory and bandwidth needs of the application. The resulting improved efficiency reduces computation time and allows the gene differential association analysis code to scale its execution with the number of processors. The code and biological data used in this study are downloadable from <url>http://www.urmc.rochester.edu/biostat/people/faculty/hu.cfm.</url> Conclusions The performance sweet spot occurs when using a number of threads per MPI process that allows the working sets of the corresponding MPI processes running on the multicore to fit within the machine cache. Hence, we suggest that practitioners follow this principle in selecting the appropriate number of MPI processes and threads within each MPI process for their cluster configurations. We believe that the principles of this hierarchical approach to parallelization can be utilized in the parallelization of other computationally demanding kernels.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Software DSM protocols that adapt between single writer and multiple writer

Author: Amza Cristiana
Cox Alan L.
Dwarkadas Sandhya
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

We present two software DSM protocols that dynamically adapt between a single writer (SW) and a multiple writer (MW) protocol based on the application's sharing patterns. The first protocol (WFS) adapts based on write-write false sharing; the second (WFS+WG) based on a combination of write-write false sharing and write granularity. The adaptation is automatic. No user or compiler information is needed. The choice between SW and MW is made on a per-page basis. We measured the performance of our adaptive protocols on an 8-node SPARC cluster connected by a 155 Mbps ATM network. We used eight applications, covering a broad spectrum in terms of write-write false sharing and write granularity. We compare our adaptive protocols against the MW-only and the SW-only approach. Adaptation to write-write false sharing proves to be the critical performance factor, while adaptation to write granularity plays only a secondary role in our environment and for the applications considered. Each of the two adaptive protocols matches or exceeds the performance of the best of MW and SW in seven out of the eight application

Infoscience - École polytechnique fédérale de Lausanne

Evaluating the performance of software distributed shared memory as a target for parallelizing compilers

Author: Cox Alan L.
Dwarkadas Sandhya
Lu Honghui
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

In this paper we evaluate the use of software distributed shared memory (DSM) on a message passing machine as the target for a parallelizing compiler. We compare this approach to compiler-generated message passing, hand-coded software DSM and hand-coded message passing. For this comparison, we use six applications: four that are regular and two that are irregular: Our results are gathered on an 8-node IBM SP/2 using the TreadMarks software DSM system. We use the APR shared-memory (SPF) compiler to generate the shared memory-programs and the APR XHPF compiler to generate message passing programs. The hand-coded message passing programs run with the IBM PVMe optimized message passing library. On the regular programs, both the compiler-generated and the hand-coded message passing outperform the SPF/TreadMarks combination: the compiler-generated message passing by 5.5% to 40%, and the hand-coded message passing by 7.5% to 49%. On the irregular programs, the SPF/TreadMarks combination outperforms the compiler-generated message passing by 38% and 89%, and only slightly underperforms the hand-coded message passing, differing by 4.4% and 16%. We also identify the factors that account for the performance differences, estimate their relative importance, and describe methods to improve the performanc

Infoscience - École polytechnique fédérale de Lausanne

Managing Static Leakage Energy in Microprocessor Functional Units

Author: Albonesi David H.
Dropsho Steven
Dwarkadas Sandhya
Friedman Eby G.
Kursun Volkan
Publication venue
Publication date: 30/11/2006
Field of study

Static energy due to subthreshold leakage current is projected to become a major component of the total energy in high performance microprocessors. Many studies so far have examined and proposed techniques to reduce leakage in on-chip storage structures. In this study, static energy is reduced in the integer functional units by leveraging the unique qualities of dual threshold voltage domino logic. Domino logic has desirable properties that greatly reduce leakage current while providing fast propagation times. However, due to the energy cost of entering the low leakage current state (sleep mode), domino logic has thus far been used only for leakage reduction in the longterm standby mode. We examine the utility of the sleep mode (while considering the aforementioned costs) when idle times are relatively short, one to a few hundred cycles, as is often the case for functional units. Using an analytical energy model suitable for architecture-level analysis, we explore the interaction of the application and technology, and the effect on energy and performance as the underlying parameters are varied, on a set of benchmarks. Our results show that if the leakage approaches the magnitude as projected in the literature, even for short idle intervals as few as ten cycles, an aggressive policy of activating the sleep mode at every idle period performs well and a more complex control strategy may not be warranted. We also propose a simple design, called Gradual Sleep, to reduce the energy impact of using the sleep mode for smaller idle periods

Infoscience - École polytechnique fédérale de Lausanne

Implementation tradeoffs in the design of flexible transactional memory support

Author: Arrvindh Shriraman
Bloom
Blundell
Fraser
Larus
Michael L. Scott
Sandhya Dwarkadas
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Software vs. Hardware Shared Memory Implementation: A Case Study

Author: Cox Alan L.
Dwarkadas Sandhya
Keleher Pete
Lu Honghui
Rajamony Ramakrishnan
Zwaenepoel Willy
Publication venue
Publication date: 20/10/2005
Field of study

We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DEC-station and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480. Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck

Infoscience - École polytechnique fédérale de Lausanne